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ABSTRACT 

In this paper, we demonstrate that it is possible to automatically 
generate fingerprints for various web server types using multifactor 
Bayesian inference on randomly selected servers on the Internet, 
without building an a priori catalog of server features or behav- 
iors. This makes it possible to conclusively study web server dis- 
tribution without relying on reported (and variable) version strings. 
We gather data by sending a collection of specialized requests to 
110,000 live web servers. Using only the server response codes, 
we then train an algorithm to successfully predict server types in- 
dependently of the server version string. In the process, we note 
several distinguishing features of current web infrastructure. 

1. INTRODUCTION 

One of the fundamental tactics used by both attackers and defend- 
ers is to understand what specific systems might be under attack 
or require defense. Individual point releases will have known soft- 
ware bugs for which off-the-shelf exploits may already be avail- 
able. [7 1 While web servers often need to know a browser's spe- 
cific user agent, e.g., to deliver content that works around known 
bugs or missing features of a given web client, web clients have no 
particular reason to know the exact version of a web server. This 
leads to a common mitigation, where a web server administrator 
will deliberately obfuscate version strings. When a server gives 
commodore64-HTTPD/l . 1 or ' ; DROP TABLE server- 
types; - as its version string, this behavior is obvious, but it 
is difficult to measure the frequency at which such modifications 
occur, or the true types of servers with obfuscated names. 

Historically, extensive work has been done on fingerprinting oper- 
ating system kernels [3], and traffic classification 1111 . often mak- 
ing use of probabilistic techniques [15], and on fingerprinting web 
browser configurations 1211 . among other areas. While various 
commercial applications use known distinguishing features of dif- 
ferent web servers to identify server types and versions 0161 , sur- 
prisingly little academic work has been conducted on the topic of 
web server fingerprinting, and much of what has been done has 
focused on detecting or preventing fingerprinting [20]. 



In our research, rather than manually cataloging every possible 
quirk, we wish to take a machine learning approach, sending a vari- 
ety of carefully crafted requests to large numbers of web servers on 
the Internet, and training our system to make subsequent identifi- 
cations. We take advantage of the fact that many installations don't 
obfuscate their published version information, allowing us to train 
our models with remarkable accuracy. 

2. DESIGN AND METHODOLOGY 

Our central design principle was to determine if variations in web 
server behavior made it possible to determine server type (e.g. Apache) 
and version (e.g. 2.2.3) without a manual analysis of individual 
pieces of software. We undertook to do this by surveying a broad 
selection of running web servers and comparing their responses to 
various requests. In the process, we gathered information on some 
behaviors and features that might be related to potential exploits 
either against either the servers themselves or against browsers ac- 
cessing the servers. 



The HTTP Protocol. Knowledge of the HTTP protocol is nec- 
essary to understand web server behavior and variations. First doc- 
umented with version 0.9 in 1991, the HTTP protocol is now stan- 
dardized on version 1.1, released by the Internet Engineering Task 
Force in its RFC 2068 of January 1997 0, and updated with RFC 
2616 in June of 1999 |6|. While the basic format of the protocol is 
relatively straightforward, there is significant opportunity for differ- 
ent implementations of the protocol to behave differently, particu- 
larly in selecting among the 41 different response codes included in 
the protocol. Furthermore, many request and response headers are 
optional, and their ordering is not specified by the protocol. This 
means that different server implementations will have measurably 
different behavior that can be used for the purpose of classification, 
even when the question of incomplete and incorrect implementa- 
tions is set aside. These variations, described in part below, make it 
possible to identify server types and versions by the various ways 
the protocol is implemented. 

In designing our experimental methodology, it was essential not to 
compromise or damage any of the servers that we studied. There- 
fore we chose a selection of correct, if somewhat unusual, HTTP 
requests, and analyzed the responses. We chose requests that would 
enable us to gather identifying information about the servers — not 
simply the basic HTTP version string that nominally represents the 
server type and version, but also a set of responses that should en- 
able the identification of the server based on its behavior, and not 
only on its reported type. In designing our experimental methodol- 
ogy, we also sought to obtain aggregate data about large numbers 



of servers, as opposed to a detailed understanding of the security 
profile of a few selected sites, and so we chose an approach that in- 
volved crawling a large number of sites and cataloging responses, 
rather than a manual inspection of a few hosts. 

2.1 Bayesian Classification 

Bayes' rule is a basic principle of probability that allows one to cal- 
culate the probability of a given hypothesis given a certain datum, 
when one knows the correlation between the datum and the hy- 
pothesized conclusion in the universe being considered. Consider 
the common application of spam filtering, where Bayesian classi- 
fication has been found to be extremely effective, even in the face 
of dedicated attempts to defeat it fTl, If one considers the pres- 
ence of a certain term in an e-mail as the datum in question, as well 
as the frequency of that term in both spam and non-spam e-mails, 
then Bayes' rule makes it possible to classify the probability that a 
given e-mail is spam based on the presence of that term. Bayesian 
inference allows multiple observations of this type to be combined, 
potentially allowing for more accurate classification of the e-mail 
as spam or not. Bayesian techniques have been used effectively for 
intrusion detection 1 12], data mining |4|, and as an effective tech- 
nique in fields well beyond the scope of computer science (8j. 

Our application is somewhat more complex than spam filtering, as 
we need to classify servers as belonging to one of many categories 
of types, and not simply as a binary quantity (spam or not spam.) 
In doing so, we use Bayes' rule to calculate the probability of a 
given server belonging to each server type, and choose the type 
with the highest probability. (See Section \2A) This also enables 
us to associate a degree of certainty with our predictions, as we 
know the likelihood that our observed characteristics belong to our 
calculated server type (5). 

2.2 Selection of Requests 

In developing our experimental methodology, we selected 10 sep- 
arate requests to send to every surveyed server. A summary of the 
requests and their purposes is shown in Table [TJ While we begin 
with a standard HTTP GET to provide a baseline, the other re- 
quests are intended to identify features that may have different (or 
missing) implementations across different web servers. They were 
also chosen with the aim of receiving the same response regardless 
of site configuration. Thus, the only files that we access directly 
are files that should be present on most servers: the root URL, the 
robots . txt file, and the f avicon . ico file. While these files 
may not be present on every server, their presence on most servers 
enables a broad comparison across sites independent of site con- 
tent. 

2.3 Datasets 

We prepared three distinct data sets. The first consisted of re- 
sponses from the Alexa top 10,000 sites, the second consisted of re- 
sponses from the top 100,000, and the third consisted of responses 
from the last 10,000 sites in the Alexa top 1 million. These three 
data sets, representing a total of 1.2 million HTTP requests and re- 
sponses, allowed us to abstract behavior for different server types, 
and to compare the differences between the largest (and presum- 
ably most carefully maintained) sites and other sites that are likely 
not as actively maintained or uniquely configured. 

2.4 Prediction of Server Types 

Our work sought to provide a methodology to identify server types 
beyond their self reported version strings. With this in mind, we 



Request 


Rationale 


An ordinary get request 
against the root URL 


This request provided a "baseline." As 
the most ordinary request that any server 
would receive, we expected it to be han- 
dled in a straightforward manner with a 
200 response code. 


A partial get request of 50 
bytes against the root URL 


This request allowed us to test the 
server's implementation of the HTTP 
partial get feature. 


A conditional get request 
for pages modified after a 
future date against the root 
URL 


This request allowed us to test the 
server's implementation of the HTTP 
conditional get feature. 


A head request against the 
root URL 


This request allowed us to test the 
server's implementation of the HTTP 
head feature. 


An options request 


This request allowed us to test the 
server's implementation of the HTTP 
options feature. 


A trace request against the 
root URL 


This request allowed us to test the 
server's implementation of the HTTP 
trace feature. 


A request for the root URL 
as a CSS stylesheet 


This request is similar to a request that 
might be generated by a browser expe- 
riencing a cross site scripting attack. A 
modern browser should watch the con- 
tent type of the response, refusing to in- 
terpret the page as CSS if it is labeled as 
HTML. 


A request for robots.txt as 
a CSS stylesheet 


Similar to the previous request, it allows 
us to test mime type support on a differ- 
ent file type (text). 


A request for a relative 
URL above the root direc- 
tory 


This request is invalid and should be re- 
jected on any server. If it were hon- 
ored, it would give the client access to 
the server's entire file system. 


A request for the favicon 


This request allows us to survey another 
content type, again checking for proper 
MIME type support. 



Table 1: Requests used in measuring server behavior 



collected the server response codes from the requests above, and 
used them to calculate a distinct fingerprint for each type of web 
server. We could then match the responses of an unknown web 
server against the fingerprints that we had developed for various 
server types, enabling us to predict its type and version. 

We associated the individual fingerprints with specific server types 
and versions by using the version string provided with the responses. 
In doing so, we did not assume that the data in the version string 
was necessarily correct. Indeed, our premise was that it is often 
changed to report something other than the correct server type and 
version. However, we did assume that, as these changes are made 
by individual administrators acting independently, that there would 
not be a consistent incorrect server string for a given fingerprint. In 
this way, the incorrect responses, even if they constituted a signifi- 
cant percentage of the sample, could be filtered out as noise, while 
the single largest reported version string for any given fingerprint 
would be, in fact, the correct one. 

Our fingerprints took the form of a dataset listing the frequency of 
each response for each server type and request type. We generated 
the fingerprint dataset by training on the raw data in our primary 
dataset, calculating the probability of a given server version given 
that server's response to our requests using a standard Bayesian 
methodology, as discussed in Section lZTl 

P, i \ P(response(~]server) 
(response\server) = — — p( server ) 

Then, for each server in the set being tested, we sought to predict 
the probability of each server type. Assuming that the response 
to one request was independent from the response to a different 
request, we used Bayes' rule to combine the probabilities for each 
request for each server type as follows: 

P (server | f}™ =0 res P ons &i) = 

p P(server) Y[ P(response^ \server) 

P(server) Yl P (response^ \server)-\-P (server) Yl P(response^ | server) 

This gave us a probability for each server type given the total set 
of responses any given site returned. We then selected the most 
probable server type as our prediction, reporting both the server 
type and the probability that we had generated. 

2.5 Limitations 

Some attempts at disguising a server's type and version may go be- 
yond changing the version string. For example, a program called 
ServerMask by Port 80 Software changes the format of headers, 
cookies, and file extensions to make Microsoft IIS resemble a dif- 
ferent type of server. Because our techniques do not use those 
particular markers to fingerprint the server, those changes should 
not affect our success. However, software that manipulated the 
response codes could easily confuse our algorithm. Indeed, it is 
possible that some of the unusual response codes that we detected 
came from software seeking to do exactly that. Additionally, our 
methodology assumes that a single web address corresponds to a 
single server technology (which may or may not be distributed 
across multiple physical machines). However, in cases where dif- 
ferent requests are handled by different server technologies (as may 
happen, for example, when separate caching servers are used) our 
attempts at identifying a single server technology used on the site 
can, at best, identify only one of the technologies used. 




fered from the calculated one. (top 10k servers based on train- 
ing data from top 100k) 

3. RESULTS 

Having developed our techniques, we applied them to our data set, 
enabling us to validate our methods and discover several interesting 
behaviors of current web servers. 

3.1 Distribution of Server Types 

In order to determine the effectiveness of our technique, it was nec- 
essary to test it on real world data. In doing so, we gathered infor- 
mation on server type from raw version strings, processed it with 
our algorithms, and examined the results. 

3.1.1 Reported server types 

While the HTTP standard provides for the use of "Product Tokens" 
(section 3.8) in the "Server" field of the response header (section 
14.38) (6), many individual sites choose to modify their headers to 
exclude this information or replace it with irrelevant or incorrect 
data. Thus, we found server versions such as Nintendo, All 
your base are belong to us, and My Arse in addition 
to more standard strings such as Apache/2 .2.20 (Ubuntu) . 
This penchant for customization not only reflects the universal hu- 
man desire for self-expression, but an understandable desire to hide 
sensitive information from potential attackers and competitors. The 
reported server types which we observed are summarized in Ta- 
ble|2] 

3.1.2 Calculated server types 

In order to correct for this behavior, we developed an algorithm 
(described above) that used multifactor bayesian analysis to predict 
the actual server type based on responses. Our algorithm produced 
different results from the server type returned by the site for ap- 
proximately 38% of sites. Figure [T] shows the confidence levels of 
our predictions when our algorithm was trained on the top 100,000 
sites, and run on the top 10,000 sites. 

While for a number of servers, we were able to predict the server 
version with a very high confidence level, in other cases, it is not 
immediately clear whether the server was reporting incorrect data, 
or whether our algorithm incorrectly predicted the type of the server. 
As it was infeasible to contact the administrators of all servers in 
the sample to inquire as to the true server version, we relied upon 
the confidence levels produced by the Bayesian analysis. Even be- 
yond this, however, some circumstantial evidence does suggest that 



our algorithm correctly predicted the results in many cases . 

Apart from the fact that our analysis produced the same results as 
the server version string in 62% of cases, some of the divergent 
results also suggest correct behavior. For example, server versions 
which are derived from Apache, such as IBM HTTP Server and 
Apache Coyote, were recognized as Apache, even though they did 
not figure into the training data for Apache. Most sites with unusual 
and obviously false version strings were recognized as belonging to 
one of the major server types. 

A more interesting question concerns the cases where servers with 
apparently correct version strings were recognized as servers of a 
different type. Are these cases of web site administrators seeking to 
intentionally obfuscate their server selection, or cases of erroneous 
behavior on the part of our software? As seen above, we were able 
to generate a confidence estimate for each of our predictions, and 
many of these results fell in the 75% confidence range, suggesting a 
likelihood that we were frequently correct, and the reported version 
strings represented an intentional effort at obfuscation. 

3.1.3 Sites reporting multiple server types 

One phenomenon that affected our analysis was the case of a num- 
ber of sites that reported different server versions to different re- 
quests. Within the top 100,000 websites, we found that 6 servers 
gave 10 different version strings to our 10 requests, 84 gave 5 or 
more, and an astounding 24,254 sites gave at least two different 
server versions. One example is Verizon.com, which alternately re- 
ported itself to be running Apache, Microsoft-IIS, Oracle-iPlanet- 
Web-Server (the successor to Netscape Enterprise Server), Aka- 
maiGHost, or returned no version string at all! 

While we have no direct evidence explaining this phenomenon, our 
data does suggest several possible conclusions. For example, a 
number of servers reported no server version at all, until they were 
asked for their favicon.ico, at which time they reported that they 
were running Microsoft-IIS. This suggested to us that a bug or con- 
figuration error in Microsoft-IIS revealed the server version on that 
particular request when an operator had desired for it to be hidden. 

On other sites, it would appear that our requests were, indeed, being 
handled by a variety of servers of different types. It would seem 
that some common requests may have gone to a server handling 
static content, while the more unusual requests got forwarded to a 
different server which, in turn, returned a different server string. 
The caching software used by Akamai Global Host, in particular, 
seemed to produce this phenomenon. 

It is conceivable that some web servers were configured to return 
different version strings to different requests, perhaps in an effort 
to confuse observers. However, we have no conclusive evidence to 
support this hypothesis. 

3.1.4 Server Types 

After running our analysis, we were able to classify the most com- 
mon web servers found in our data set. The results are shown in 
Table[2] We give both the reported server types and the calculated 
server types for comparison. As can be seen, the 15% of servers 
which did not report a version have all been classified into one of 
the major versions. 

Many servers are re-classified as Apache, bringing the relatively 
low number of Apache servers found in the top 10,000 sites up to 



a level more typical of the internet as a whole. Because we assign 
all servers to the most likely category, even when our confidence 
in our estimates are low, this may result in various obscure server 
versions being classified as Apache with a low confidence rating. 
As the most common server type, Apache becomes the "best guess" 
when no good classification is possible. 

Additionally, the number of servers reporting Microsoft-IIS can be 
seen to have increased. This is consistent with the known existence 
of software designed to obscure Microsoft-IIS servers. More inter- 
esting is the virtual disappearance of Microsoft-HTTPAPI servers. 
The Microsoft HTTP API is designed to let programs written in C 
service HTTP requests for specific URLs. It is a relatively low-level 
API, exposing basic HTTP functionality to the programmer. While 
there is no definitive way of knowing whether our re-classification 
of Microsoft HTTP API version strings was correct or not, it is 
worth noting that Microsoft HTTP API is not so much a server 
type as it is an interface for various programs to function as a web 
server. Thus, it seems likely that there is limited commonality be- 
tween the way different programs making use of this API handled 
our requests. It may be the case that, because each program using 
this API is effectively a unique server type, that our algorithm was 
unable to effectively classify them. 

Of the three smaller server versions, AkamiaGHost, cloudflare- 
nginx, and LiteSpeed, two are not server types at all, but content 
delivery networks. CloudFlare and Akamai specialize in hosting 
static content in locations close to users, enabling faster page re- 
trieval. For this reason, it is not surprising that their servers should 
be re-classified as one of the common types — their server version 
strings represent a delivery network and not a version type. Lite- 
Speed, on the other hand, is a proprietary server technology. While 
it advertises itself as being "completely Apache interchangeable," 
1131 LiteSpeed does not advertise itself as being an Apache deriva- 
tive. The general classification of LiteSpeed servers as Apache may 
reflect similarities in the behavior of the two servers as well as Lite- 
Speed's small market share. 

3.7.5 Apache Versions 

In addition to the server type, the software version is an interesting 
topic of study. It provides information as to how up-to-date a server 
is, an interesting question both for study and for potential attackers. 
For this reason, we set out to see what information we could extract 
from our dataset regarding server versions. 

Most servers in our dataset did not disclose their exact version, but 
a sufficient number provided the information to make some analy- 
sis possible. Because Apache servers possessed the largest market 
share — and hence, the most complete data — we chose to study 
their version strings in order to get a better understanding of the 
server population in active use. Although we have not studied other 
server types in the same detail, we expect these results to be simi- 
lar to Apache derivatives and that the results can be generalized to 
other servers. 

Our first step in studying the distribution of Apache servers was 
simply to examine the version strings returned by the servers them- 
selves. As is shown in Appendix B, the version numbers of Apache 
servers in active use vary widely. While many are recent versions 
such as 2.2.22 and 2.2.23 (both 2012 releases) some servers use 
Apache versions dating from the late 1990s and early 2000s. In- 
deed, more than 10,000 responses of the 963,000 received from 
the top 100,000 sites reported some version of Apache 1.3, which 



Server Type 


Top 100k (Raw) 


Top 10k (Raw) 


Top 10k 

(Corrected) 


Bottom 10k (Raw) 


Apache 


40.7% 


29.6% 


59.6% 


52.8% 


Nginx 


12.8% 


16.1% 


17.6% 


8.2% 


Microsoft-IIS 


12.7% 


9.5% 


12.7% 


12.9% 


Not Reported 


12.3% 


15.5% 


None 


9.1% 


Microsoft-HTTPAPI 


6.3% 


5.2% 


0.1% 


5.3% 


AkamaiGHost 


2.6% 


6.4% 


None 


0.2% 


Cloudfiare-Nginx 


1.5% 


0.9% 


None 


0.9% 


LiteSpeed 


1.5% 


0.8% 


None 


1.2% 



Table 2: Prevalence of server types based on raw version strings and calculated values. 
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Figure 2: Release year for Apache software currently in use 



reached its end of life in 2010, although patches are still produced. 

Server version strings, like server types, are subject to the caprice 
of site maintainers, who have an incentive to report incorrect in- 
formation in an attempt at security by obscurity. Not only did 
slightly more than half of the servers reporting themselves as run- 
ning Apache give no version details at all, but in the remainder, 
a few obviously false version strings such as "4.0.4" and "6.6.6" 
were returned. However, it is unclear how many of the plausible 
version strings are in fact false. 

Using the same probabilistic techniques that we used to predict 
server types, we attempted to predict server versions. First, we 
took all of the servers that reported their server type as Apache, and 
trained our algorithm based on their reported versions from the top 
100k. We then applied the trained algorithm to the smaller data sets, 
seeking to predict the version for servers reporting Apache . Unfor- 
tunately, the responses received from our queries did not meaning- 
fully differentiate Apache versions. It is possible that variations 
among Apache versions might enable a researcher to determine 
Apache versions sending different requests or by measuring dif- 
ferent properties in HTTP responses, but we leave this to future 
work. 

3.1.6 Comparison with others' results 
One way to estimate the accuracy of our results is to compare them 
against other sources. Netcraft gives results for "all domains" — 
more than 625 million domains. They report the top servers as 
Apache with 57.2%, Microsoft with 16.5%, nginx with 1 1.9%, and 
Google with 3.4% 1171 . These results are largely consonant with 
our findings, although the scale of the Netcraft survey is obviously 



much broader, and so the results can not be directly compared. 
In a similar survey of 63.5 million servers, Security Space reports 
Apache with 68.7%, Microsoft with 14.89%, and Other (presum- 
ably including nginx) with 16.17% of the market [18 |. While the 
results are largely consistent, we did not have access to the method- 
ology for these two surveys, so we were unable to determine if they 
were naively relying on server version strings, or using some more 
sophisticated method of identifying server software. 

3.2 Server Responses 

In analyzing our data, we found that web servers returned 44 differ- 
ent response codes, many of which do not form part of the official 
HTTP specification. The table in Appendix A illustrates the range 
of responses for our various requests. Some of these responses 
are particularly noteworthy. For example, when we request text 
and HTML documents as CSS, only a relatively small percentage 
of servers return code 406, indicating that the document can not 
be returned given the parameters sent by the client. Interestingly 
enough, the number of servers returning the response varied de- 
pending on the request. When we asked for a text file as CSS, 
9,338 servers gave us a 406 error, while only 3,229 gave us that 
error when we asked for an HTML document as CSS. Much more 
common is code 200, sending the document despite the unusual re- 
quest. While this response is perfectly compliant with the HTTP 
standard, which provides for the return of documents in a different 
type if the requested type is not available, it may be a less wise re- 
sponse from a security perspective, as the request suggests a cross 
site scripting attack is in progress, and the attacker is looking for 
access to user data from that page. 

Most servers correctly returned an error (400 — bad request, 403 
— not authorized, or 404 — not found) to our request for a relative 
URL. However, the 4,759 servers that returned a 200 code were not 
mostly servers exposing a security vulnerability. Instead, they were 
mostly configured to return a standard page for any malformed re- 
quest - not a strict interpretation of the HTTP standard, but not an 
immediate security risk. For some reason, a 405 (not allowed) re- 
sponse was returned by nearly half the servers for our trace request, 
rounding out the significant anomalies found in the version codes. 



Unsupported Features. A number of the features used in our 
requests seemed simply not to be supported by a majority of web 
servers. For example, our conditional get, which asked for a page 
only if it had been modified after a future date, only received the 
expected 304 response from some 7,000 of the 97,000 responding 
servers. The remainder simply served up the page, even though the 
condition had clearly not been met. The actual 501 "Not Imple- 
mented" response appeared relatively rarely, with 14,000 servers 



returning it for our trace request, and 3,700 returning it for our op- 
tions request. 

3.3 Character Encoding 

Not surprisingly, a wide variety of character sets were returned by 
our servers, with UTF-8 being the most common, followed by ISO 
8859-1 and windows-1251. A small portion of servers were 
obviously misconfigured, returning an empty string or a string such 
as $conf_pass->charset . We did not develop tests to see if 
the charsets for the remaining servers matched the actual encoding 
of the content, but it seems reasonable to assume that additional 
misconfigurations exist in that space, as well. 

More significant is the fact that more than one third of servers 
(36,000 out of our 100,000) simply did not return a character en- 
coding at all. While correct charset encoding is more important to 
user experience than to security, several existing security vulnera- 
bilities relate to incorrect or missing character encoding, indicating 
this is a risk on some deployed servers 11191 . 

3.4 MIME Types 

Also interesting from a security perspective is the range of MIME 
types received. While the responses to our standard GET request 
were mostly the expected text /html, the responses to some our 
other requests had more variety. Responses such as application/ 
java-archive and application/ json for an HTML doc- 
ument hinted at server misconfigurations. Significantly, 0.1% of 
servers returned no MIME type at all, exposing themselves to ex- 
ploits where an attacker crafts a site to cross-load resources as a 
different MIME type, thus extracting data from a confidential doc- 
ument. 

3.5 XSS Vulnerabilities 

One of the more curious MIME type handling behaviors that we 
observed was found in a number of servers that generate the MIME 
type for their response based not on the actual type of the file, but 
on the type requested by the client. Indeed, 208 sites in the top 
100k, mostly reporting Nginx as their server type, exhibited this 
behavior. This is a potential vulnerability, as it opens clients to 
cross site scripting attacks 1101 . 

In a cross origin CSS attack, an HTML page with confidential in- 
formation is loaded by a script running from an attacker's site. If 
certain short CSS sequences have been injected into the page, the 
attacker is able to read portions of the secure page. Most browsers, 
including Chrome, Safari, and Firefox, prevent this by refusing to 
load files as CSS resources cross origin if they present a conflicting 
MIME type. However, when asked for a HTML page as CSS, these 
sites, including sites like kickstarter.com and causes.com which 
handle sensitive data, report that the HTML page is, in fact, CSS, 
potentially enabling this sort of attack against their users. While 
browsers could protect themselves against these attacks by deter- 
mining a document's content type through analysis, this behavior 
represents a significant weakness J2]. 

4. FUTURE WORK 

Our results show that it is possible to gather significant data about a 
web server and its vulnerabilities based on a few carefully selected 
HTTP requests. However, there is much room available for fur- 
ther work in refining this process. In particular, additional work is 
needed to develop a library of requests that are particularly effec- 
tive at gathering distinguishing information about a server and its 



security vulnerabilities. Our requests were designed on the basis 
of informed judgment, but systematic tests with a larger collection 
of possible requests would likely result in a collection that allowed 
greater accuracy in predicting server type and in detecting vulnera- 
bilities. 

Additionally, our analysis focused primarily on the response codes 
and MIME types returned from the server. Further research could 
investigate other details of the server's response, perhaps finding 
variations that enable better classification and analysis of the servers 
studied. Furthermore, ongoing research could survey a broader se- 
lection of servers, both less visited servers hosting sites for the gen- 
eral public and the embedded servers found in so many devices. 

While our work focused on web servers, and much similar work 
has been done on general network traffic, there are a broad range 
of additional Internet technologies where probabilistic techniques 
could be used to yield a better understanding of software distribu- 
tion, maintenance, and behavior. This presents many interesting 
areas for ongoing study. 

5. CONCLUSIONS 

We have demonstrated that it is possible to predict server type based 
on server responses, even without a prior analysis of server be- 
havior. This enables a more accurate classification of web server 
types than would be possible using server version strings alone. It 
also shows that attempts at security through obscurity by hiding 
the server version do not provide effective protection to vulnera- 
ble web servers. While we were not able to effectively calculate 
server versions using the technique of analyzing response codes to 
individual requests, there is no fundamental reason why similar, 
but more precise, techniques might not also be able to differentiate 
among versions. 

Additionally, even our broad survey, not focused on any particu- 
lar site or vulnerability, was able to reveal a number of potentially 
serious configuration issues. This is despite the fact that we fo- 
cused only on the most visited websites, which would presumably 
be among the best maintained. Among the misconfigurations ob- 
served were improper MIME type configurations that allow a doc- 
ument to be served with no MIME type or with an incorrect type 
based on the client's request. Dated server software was also found 
to be widely in use. The maintenance of a web site is, of course, 
the responsibility of the site's owner, but we hope that our research 
may provide a minor stimulus to raising the profile of the dangers 
caused by mis-configured server software. 
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Table[8]gives a listing of all of the response strings returned by the 
top 100,000 servers to our various requests. As can be seen, most 
of the responses cluster into a few expected categories, but there 
are large numbers of outliers, including many version codes not 
defined in the official standard. Even some official codes occurred 
in surprising places. For example, the IETF defines response code 
418 as being an appropriate response from a teapot when instructed 
to brew coffee 11141 . As none of our requests were related to the 
preparation of hot beverages, it is safe to assume that this code was 
being used outside of its defined scope. 



8. APPENDIX A: RESPONSES FROM SERVERS 



Response 

/ ",,,|,, 

code 


Condit- 
ional 
Get 


Get 
Favlcon 


Get 
Relative 
URL 


Get 


Head 


Get 
HTML 

as CSS 


Options 


Partial 
ijet 


Trace 


Get text 
as C5>5> 























1 


1 




















1 


200 


88,689 


73,599 


4,759 


95,654 


94,059 


91,877 


80,192 


95,653 


27,728 


65,003 


203 










3 












204 


3 


79 


6 


3 


11 


3 


10 


3 


2 


4 


205 




1 


3 








4 






2 


300 




3 
















2 


301 


2 


1 


1 


2 


4 


2 


3 


2 


3 


1 


302 


1 


3 




1 


1 


3 




1 




3 


304 


6,985 


1 














1 




400 


69 


72 


66,759 


70 


148 


70 


385 


69 


289 


66 


401 


26 


26 


11 


26 


28 


24 


271 


26 


200 


25 


402 














2 








403 


411 


268 


16,331 


410 


952 


407 


1,676 


409 


9,492 


405 


404 


357 


22,455 


6,616 


356 


540 


417 


1,221 


359 


1,508 


21,547 


405 






3 




243 




7,768 




40,968 


1 


406 


1 


5 


6 


1 


3 


3,229 


39 


1 


45 


9,338 


407 














2 




2 




408 


2 


2 


3 


2 


2 


3 


2 


2 


2 


2 


409 


1 


1 




1 


1 


1 


1 


1 




1 


410 


1 


20 


34 


1 


1 


1 


1 


2 


3 


10 


411 














21 








412 














1 








413 


















17 




417 






25 
















418 






1 
















420 














1 








422 














11 








429 


1 






1 


1 


1 










440 










1 












499 






1 
















500 


119 


146 


832 


124 


253 


533 


335 


122 


196 


193 


501 




1 


8 




31 


2 


3,663 




14,002 




502 


53 


34 


11 


53 


63 


56 


87 


52 


12 


44 


503 


88 


57 


85 


87 


192 


94 


158 


85 


76 


59 


504 


10 


7 


11 


12 


11 


9 


10 


13 


7 


6 


508 


2 


1 


2 


1 


1 


4 


1 


1 


1 


1 


509 


1 


1 


6 


1 


1 


1 


1 


1 


1 


1 


550 










1 








1 




599 










1 












770 


1 


1 


1 


1 


1 


1 


1 


1 




1 


801 






1 








1 




1 




901 




















1 


999 














6 




29 




Totals 


96,823 


96,784 


95,516 


96,807 


96,553 


96,738 


95,874 


96,803 


94,586 


96,718 



Figure 3: Server response codes from top 100,000 servers by request 



9. APPENDIX B: APACHE VERSIONS 

The following versions of Apache were reported by servers in the 
Alexa top 100k. A string such as 2.X.X indicates that the server 
only reported Apache 2, and so on. 



Apache Version 


No. of Samples 
Reported 


Release Date 


Apache/ 1.3. 3 


9 


9-Oct-98 


Apache/1.3.9 


2 


19-Aug-99 


Apache/1.3.11 


5 


21-Jan-00 


Apache/1.3.19 


8 


l-Mar-01 


Apache/1.3.20 


40 


21-May-01 


Apache/1.3.22 


10 


12-Oct-01 


Apache/1.3.23 


26 


21-Jan-02 


Apache/1.3.24 


27 


22-Mar-02 


Apache/1.3.26 


177 


18-Jun-02 


Apache/1.3.27 


265 


3-Oct-02 


Apache/1.3.28 


40 


16-M-03 


Apache/1.3.29 


264 


29-Oct-03 


Apache/1.3.31 


180 


ll-May-04 


Apache/1.3.32 


8 


Not released 


Apache/ 1.3. 3 3 


559 


29-Oct-04 


Apache/1.3.34 


509 


18-Oct-05 


Apache/1.3.35 


19 


l-May-06 


Apache/1.3.36 


77 


17-May-06 


Apache/1.3.37 


1,223 


28-M-06 


Apache/1.3.39 


271 


7-Sep-07 


Apache/1.3.41 


2,964 


19-Jan-08 


Apache/1.3.42 


3,618 


2-Feb-08 [EOL] 


Apache/1.4.0 


10 


n/a 


Apache/1.4.X 


100 


n/a 


Apache/1.9.0 


1 


n/a 


Apache/2.0.4 


10 


n/a 


Apache/2.0.6 


10 


n/a 


Apache/2.0.29 


10 


Not released 


Apache/2.0.35 


7 


5-Apr-02 


Apache/2.0.40 


96 


9-Aug-02 


Apache/2.0.43 


10 


3-Oct-02 


Apache/2.0.44 


10 


20-Jan-03 


Apache/2.0.45 


10 


l-Apr-03 


Apache/2.0.46 


270 


28-May-03 


Apache/2.0.47 


10 


9-Jul-03 


Apache/2.0.48 


20 


29-Oct-03 


Apache/2.0.49 


61 


1 9-Mar-04 


Apache/2.0.50 


70 


30-Jun-04 


Apache/2.0.51 


144 


i j-oep-u^f 


Apache/2.0.52 


3,107 


28-Sep-04 


Apache/2.0.53 


132 


7-Feb-05 


Apache/2.0.54 


450 


17-Apr-05 


Apache/2.0.55 


348 


16-Oct-05 


Apache/2.0.58 


122 


l-May-06 


Apache/2.0.59 


1,226 


28-Jul-06 


Apache/2.0.61 


204 


7-Sep-07 


Apache/2.0.63 


4,858 


19-Jan-08 


Apache/2.0.64 


3,804 


19-Oct-10 


Apache/2.0.X 


133 


on or after March 


10, 2000 


Apache/2. l.X 


20 


before December 1, 


2005 



Apache Version 


No. of Samples 
Reported 


Release Date 


Apache/2.2.0 


892 


l-Dec-05 


Apache/2.2.1 


10 


Not released 


Apache/2.2.2 


214 


l-May-06 


Apache/2.2.3 


41,012 


28-Jul-06 


Apache/2.2.4 


927 


9-Jan-07 


Apache/2.2.6 


1008 


7-Sep-07 


Apache/2.2.8 


3,443 


19-Jan-08 


Apache/2.2.9 


8,635 


14-Jun-08 


Apache/2.2.10 


1,401 


14-Oct-08 


Apache/2.2.11 


3,270 


14-Dec-08 


Apache/2.2.12 


1,126 


28-M-09 


Apache/2.2.13 


819 


8-Aug-09 


Apache/2.2.14 


12,018 


3-Oct-09 


Apache/2.2.15 


15,230 


5-Mar-10 


Apache/2.2.16 


18,947 


25-Jul-10 


Apache/2.2.17 


12,667 


18-Oct-10 


Apache/2.2.18 


299 


11-May-ll 


Apache/2.2.19 


4,108 


21 -May- 11 


Apache/2.2.20 


4,010 


30- Aug- 11 


Apache/2.2.21 


11,910 


13-Sep-ll 


Apache/2.2.22 


39,446 


31 -Jan- 12 


Apache/2.2.23 


13,660 


13-Sep-12 


Apache/2. 2.X 


6,071 


on or after Decem- 
ber 1, 2005 


Apache/2.3.5 


20 


26- Jan- 10 


Apache/2.3.6 


10 


17-Jun-10 


Apache/2.3.8 


28 


31-Aug-10 


Apache/2.3.11 


7 


7-Mar-ll 


Apache/2.3.14 


7 


9- Aug- 11 


Apache/2.3.16 


6 


20-Dec-ll 


Apache/2.4.0 


8 




Apache/2.4.1 


182 


17-Feb-12 


Apache/2.4.2 


257 


17-Apr-12 


Apache/2.4.3 


506 


21-Aug-12 


Apache/2.4.X 


40 


2012 


Apache/2.X.X 


10,891 


on or after March 


10, 2000 


# of samples re- 
porting version: 


238,508 





